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Despite more than 50 years of effort, 
the origin of the genetic code 
remains enigmatic. Among different 
theories, the stereochemical hypothesis 
suggests that the code evolved as a con- 
sequence of direct interactions between 
amino acids and appropriate bases. If 
indeed true, such physicochemical foun- 
dation of the mRNA/protein relation- 
ship could also potentially lead to novel 
principles of protein— mRNA interactions 
in general. Inspired by this promise, we 
have recently explored the connection 
between the physicochemical properties 
of mRNAs and their cognate proteins at 
the proteome level. Using experimentally 
and computationally derived measures 
of solubility of amino acids in aqueous 
solutions of pyrimidine analogs together 
with knowledge-based interaction pref- 
erences of amino acids for different 
nucleobases, we have revealed a statisti- 
cally significant matching between the 
composition of mRNA coding sequences 
and the base-binding preferences of their 
cognate protein sequences. Our findings 
provide strong support for the stereo- 
chemical hypothesis of genetic code's ori- 
gin and suggest the possibility of direct 
complementary interactions between 
mRNAs and cognate proteins even in 
present-day cells. 

Introduction 

The discoveries of the genetic code, 1 
mRNA, 2 and protein synthetic machin- 
ery, 3 together with the structure of 
DNA, 4 provided a powerful foundation 
for explaining information passage from 



genes to proteins in atomistic and molecu- 
lar terms. However, our understanding 
of the relationships between these major 
cellular biopolymers still lacks a strong 
evolutionary perspective. In particular, 
the physicochemical driving forces, which 
have led to the establishment of the funda- 
mental organization of information trans- 
fer in the cell, remain to be elucidated. 

A particularly pertinent problem 
in this regard is the interconnection 
between mRNA and proteins. Whereas 
the two are obviously related by the uni- 
versal genetic code, the principal building 
block of all life, its origin still remains one 
of the most important, yet unanswered 
foundational questions in biology. 5 What 
is more, the mRNA— protein relation- 
ship extends significantly beyond just the 
coding context, especially if one consid- 
ers the wide cellular interaction network 
contributing to the regulation of protein 
expression. For example, recent studies of 
protein— mRNA interactomes in eukary- 
otic cells have revealed a surprising fact 
that a number of proteins without recog- 
nizable RNA binding domains, includ- 
ing several transcription factors and 
metabolic enzymes, nevertheless display 
mRNA binding ability. 6 " 9 Such interac- 
tions also include direct binding to cog- 
nate mRNAs, which has over the years 
been detected for a diverse set of pro- 
teins. 6,10 " 15 These findings open up a pos- 
sibility that protein— mRNA interactions 
could be a more common phenomenon 
in the cell than previously thought, and 
invite formulation of novel, fundamental 
physicochemical principles behind such 
interactions. 
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Figure 1. Matching of mRNA coding-sequence pyrimidine (PYR) profiles and their cognate protein 
sequence polar-requirement (PR) profiles. (A) Distribution of Pearson correlation coefficients (ft, 
x-axisj between window-averaged PYR-content profiles of individual mRNA-coding sequences 
and window-averaged PR sequence profiles of their cognate proteins for the human proteome 
(window size is 21 amino acids/codons). P (y-axis) corresponds to bin-size-normalized probabil- 
ity density. Inset: the median Rs for the human proteome obtained using protein PR sequence 
profiles and different nucleobase density mRNA profiles. (B) Comparison between mRNA PYR- 
content and their cognate protein PR profiles for three exemplary human proteins. Experimental 
PR scale 18 has been used for all comparisons presented here. Note that due to the definition of the 
PR scale, negative correlations indicate positive matching between the given content of mRNAs 
and the affinity for PYR analogs of their cognate proteins and vice versa. 



In two recent studies, we have pre- 
sented evidence of intrinsic potential of 
mRNAs and their cognate proteins to 
interact, which could simultaneously 
also explain fixation of the relationship 
between these two biopolymers in the 
modern genetic code. 16,17 More specifi- 
cally, we have compared nucleotide con- 
tent of naturally occurring mRNA coding 
sequences with the propensity of cognate 
protein sequences to interact with differ- 
ent nitrogenous bases. This was performed 
using both experimentally 18 and computa- 
tionally 19 derived polar requirement (PR) 
scales capturing the solubility of amino 
acids in aqueous solutions of pyrimidine 



analogs, and knowledge-based interac- 
tion preferences between amino acids and 
RNA bases derived by analyzing binding 
interfaces in the known 3D-structures of 
protein— RNA complexes. Already in the 
1960s and 1970s, Carl Woese and co- 
workers have shown that depending on 
type, amino acids exhibit different, clearly 
defined preferences for interacting with 
pyrimidine analogs (the PR scale). 18,20 
This was then used to support the stereo- 
chemical hypothesis for the origin of the 
genetic code, the idea that the specific 
pairing between individual codons and 
their cognate amino acids stems from 
direct binding preferences of the two for 



each other. 5,21 " 25 At the same time, the 
connection between the PR of individual 
amino acids and pyrimidine (PYR) con- 
tent of their codons had never been quan- 
titatively explored, and the same is true for 
amino-acid propensities to interact with 
other nucleobases. Even more so, the rela- 
tionship between the nucleobase density 
of mRNA-coding regions and the base- 
binding propensities of their cognate pro- 
tein sequences at the whole proteome level 
remained fully unexplored. 

Matching Between 
mRNA Composition and 
Nucleobase Binding Propensities 
of their Cognate Proteins 

We have studied this relationship for com- 
plete proteomes of 15 different organisms, 
five from each domain of life. 16 First, we 
have shown that the average PYR con- 
tent of mRNA coding sequences exhibits 
an extremely strong inverse correlation 
with the average PR of their cognate pro- 
tein sequences over complete proteomes. 
For example, the Pearson correlation 
coefficient R between the two variables 
approaches -0.9 for the human proteome. 
Since the PR scale assigns low values to 
amino acids with high propensity to inter- 
act with PYR analogs and vice versa, this 
means that the PYR content of mRNA 
coding sequences is directly proportional 
to the average affinity of their cognate pro- 
tein sequences for PYR-like compounds. 
Second, we have evaluated the level of 
matching between window-averaged 
PYR profiles of individual mRNA coding 
sequences and PR profiles of their cognate 
protein sequences and observed that the 
distributions of the thus-obtained Rs for 
all 15 species exhibit great similarity, with 
the average R typically in the vicinity of 
-0.7. To illustrate this finding, we present 
here the results of such a comparison for 
the human proteome using the original 
experimental PR scale derived by Woese 
and co-workers 18 (Fig. 1A). The distribu- 
tion of Pearson Rs obtained by compar- 
ing individual mRNA PYR profiles and 
their cognate protein PR profiles is promi- 
nently shifted toward strong negative cor- 
relations with the median value of -0.69. 
Importantly, the PYR density profiles 
of mRNA display stronger correlations 
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with PR profiles of their cognate protein 
sequences than any other type of mRNA 
compositional profiles (Fig. 1A, inset). As 
shown in our original study, 16 these corre- 
lations further improve by several percent- 
age points if one uses a computationally 
derived PR scale, albeit with no impact on 
any qualitative conclusions. 

What do these correlations mean at the 
level of individual mRNA/protein pairs? 
In Figure IB we present PYR mRNA 
density profiles aligned with PR profiles of 
their cognate proteins for three select cases 
where the two display strong correlation — 
a membrane protein (potassium channel 
KCNOl), a cytosolic globular protein 
(hemoglobin a-subunit), and a signifi- 
cantly unstructured nuclear/cytosolic pro- 
tein (tumor suppressor p53). While these 
three well-known proteins were purpose- 
fully chosen to illustrate a high level of 
matching, one should emphasize that even 
the median level of correlation (-0.69) cor- 
responds to profiles with strong matching, 
as discussed in more detail in our original 
work. As one can see in these examples, 
the mRNA PYR densities quantitatively 
mirror sequence profiles of cognate pro- 
teins capturing their affinity for PYR ana- 
logs. In other words, PYR-rich regions in 
mRNAs directly correspond to the regions 
in their cognate proteins with high affin- 
ity for PYR analogs and vice versa. What 
is more, the universal genetic code appears 
to be highly optimized with respect to 
maximizing this matching: for more than 
half of the proteomes tested, the native 
genetic code outperforms each and every 
one of the 10 6 randomized codes tested. 16 

These compelling results notwith- 
standing, the PR scale still leaves several 
important questions unanswered. First, 
while this scale captures amino-acid 
affinities for pyrimidine-like compounds, 
it remains silent about the potentially 
equally important purine (PUR) affini- 
ties. Second, substituted pyridines, as used 
in deriving the PR scales, are only proxies 
for the real biologically relevant nucleo- 
bases. How does the picture change if one 
looks at amino-acid propensities to bind 
uracil, cytosine, guanine, or adenine? To 
address these questions, we have derived 
interaction preferences between differ- 
ent amino acids and RNA nucleobases 
by focusing on binding interfaces in the 



known 3D structures of protein— RNA 
complexes (approximately 300 high- 
resolution X-ray and NMR structures, 
including five ribosomal structures). 17 In 
the process, we have isolated sequence- 
specific protein-RNA contacts where each 
amino-acid side chain is surrounded by 
more than one RNA base, while ignoring 
non-specific interactions defined solely by 
protein or RNA backbones. The scales of 
amino acid/nucleobase interaction prefer- 
ences were derived by employing standard 
distance-independent contact potential 
formalism 26 " 30 and allowed us to explore 
the connections between composition 
of mRNA codons and the preferences 
of amino acids to interact with different 
nucleobases at protein/RNA interfaces 
(see also the section below). Moreover, 
this approach also allowed us to examine 
analogous matching for sequence profiles 
of complete mRNA coding sequences and 
their cognate proteins. Strikingly, we have 
observed a strong correlation between 
preferences of amino acids to interact with 
guanine (G-preference) and the average 
PUR-content of their respective codons 
(R = -0.84). 17 In other words, amino 
acids, which are predominantly encoded 
by PUR bases, display a strong tendency 
to co-localize with G at protein— RNA 
interfaces and vice versa. Consequently, 
we have also found that sequence profiles 
of protein G-preferences strongly match 
PUR density in cognate mRNAs (median 
R = -0.80 for the human proteome, 
Fig. 2A) and that this effect is stronger 
than anything one sees with other possible 
amino-acid/nucleobase interaction prefer- 
ences (Fig. 2A, inset). A similar, but some- 
what weaker signal was observed when it 
comes to mRNA PYR or PUR profiles 
and sequence preferences of their cog- 
nate proteins to interact with pyrimidine 
(PYR-preference) or purine bases (PUR- 
preference), respectively. 17 Most notably, 
for adenine (A-preferences) this correla- 
tion is significant, but reverse (median R 
= 0.53 for the human proteome, Fig. 2A): 
protein sequence regions encoded by PUR- 
rich mRNA sequence stretches prefer not 
to interact with adenines and vice versa, 
and the same is true for A-rich mRNA 
sequences. 17 

Using the same mRNA/protein 
pairs as in Figure 1, we demonstrate the 



close matching between mRNA PUR 
density profiles and knowledge-based 
G-preference profiles of their cognate pro- 
tein sequences (Fig. 2B), which for these 
three proteins actually exceeds the match- 
ing seen for mRNA PYR density profiles 
and their cognate proteins' PR profiles. 
Taken together, the results of both of 
our studies clearly demonstrate a strik- 
ing level of mirroring between PYR- or 
PUR-density profiles of mRNA coding 
sequences and base-binding-preference 
profiles of their cognate proteins, inde- 
pendently of how the respective amino- 
acid preferences are obtained (e.g., an 
experimental or computational PR scale, 
and a number of knowledge-based scales). 
These observations have allowed us to pro- 
pose a novel and potentially wide-reaching 
hypothesis that mRNA-coding regions 
may be physicochemically complementary 
to the respective cognate protein regions 
and bind, especially if the complementary 
segments are available for interaction such 
as in the case of unstructured mRNA and 
protein stretches. In fact, we would like to 
propose that such direct complementary 
binding interactions may be a key element 
underlying the whole mRNA/protein rela- 
tionship when it comes to both its evolu- 
tionary development as well as present-day 
biology. 16 ' 17 

Correlations in the 
Genetic Code Reveal its 
Physicochemical Nature 

Knowledge-based amino-acid interaction 
preference scales allowed us also to test 
the physicochemical relationship between 
amino acids and the composition of their 
codons. This is especially important in the 
stereochemical perspective of the origin of 
the genetic code. 5,21 " 25 For this purpose, we 
have calculated correlations between the 
average codon composition in the human 
proteome and the respective amino-acid 
interaction preferences. We divide amino 
acids into two equally sized subsets: (1) 
amino acids, which can be obtained abi- 
otically in similar conditions as in the 
classic Miller-Urey experiments, 31 and 
(2) amino acids whose synthesis requires 
more complicated biochemical path- 
ways. 5 According to the coevolution the- 
ory, 32,33 amino acids in the first subset are 
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Figure 2. Matching of mRNA coding-sequence purine (PUR) density profiles and protein 
sequence knowledge-based interaction preference profiles. (A) Distribution of Pearson correla- 
tion coefficients (ft, x-axisj between mRNA PUR profiles and G- (blue) or A-preference (magenta) 
sequence profiles of human proteins. P (y-axis) corresponds to bin-size-normalized probability 
density. Inset: the median ftsfor the human proteome obtained using PUR mRNA profiles and 
different knowledge-based interaction preference scales for cognate protein sequences. (B) Com 
parison between mRNA PUR density profiles and their cognate protein G-preference profiles for 
the same three exemplary human proteins as given in Figure IB. Note that due to the definition 
of knowledge-based scales, negative correlations indicate positive matching between the given 
content of mRNAs and the binding preferences of their cognate proteins and vice versa. 



evolutionarily "old," in a sense that they 
were utilized during code evolution from 
prebiotic synthesis, while the second type 
of amino acids entered the code by means 
of biosynthesis from the "old" amino 
acids or were introduced into proteins 
through post-translational modifications. 
Interestingly, we observe relatively strong 
correlations between G-binding prefer- 
ences and G-content of codons for the first 
type of amino acids (Fig. 3A), and simi- 
larly so for their C-binding preferences 
and C-content of codons. On the other 
hand, these correlations are weak for the 
"new" amino acids and also decrease when 
it comes to all 20 amino acids (Fig. 3C). 
In contrast, strong anti-correlation in the 
case of A is observed for the "new" amino 



acids (Fig. 3B) and it remains so over all 
20 (Fig. 3C). In other words, the G- and 
C-content of codons for biosynthetically 
more primitive amino acids from the first 
set is strongly related to their tendency to 
preferentially co-localize with G and C 
at protein-RNA interfaces, respectively. 
Conversely, the A-content of biosyn- 
thetically more complex amino acids is 
inversely related to their A-binding prefer- 
ences at protein-RNA interfaces. 

Interestingly, the codons of the "old" 
amino acids are enriched in G and C as 
compared with the "new" amino acids 
(the average G/C content of 0.58 vs. 0.41, 
respectively, P value = 0.097, Welch two 
sample t-test), which was also noticed 
before. 34 Summarizing these facts, we can 



speculate that a stereochemical connection 
could have played a role in shaping the 
genetic code in the early stages by provid- 
ing a mapping between prebiotic amino 
acids and G/C-rich codons, whereas 
engagement of new amino acids required 
more of A and U to be included in the 
codons. The latter could have been driven 
by other factors, such as evolution of meta- 
bolic pathways 32,33 or optimization of code 
stability, 35,36 which also further reduced 
the direct correspondence between codon 
content and amino-acid affinities. Our 
findings may also explain in part some 
of the difficulties in trying to prove the 
direct stereochemical nature of the genetic 
code 23 " 25 : for example, our results sug- 
gest that some amino acids (especially 
evolutionarily more recent ones whose 
anticodons are A-rich, such as phenylala- 
nine) may also exhibit appreciable affin- 
ity for their anticodons. This suggestion 
agrees with the findings of Johnson and 
Wang who have detected significant co- 
localization of a number of evolutionarily 
more recent amino acids and their cog- 
nate anticodons in ribosomal structures. 25 
What is more, following similar previous 
suggestions, 5 our findings underline the 
fact that only a combination of different 
hypotheses could give a complete view of 
the origin of the genetic code. At the same 
time, our results give strong support to the 
possibility of direct templating of proteins 
from mRNA in the era before the devel- 
opment of ribosomal decoding and code's 
fixation in that era. 22,37 This scenario, first 
put forth by Carl Woese, suggests that 
amino acids in ancient systems associ- 
ated with mRNA directly in the course of 
translation, following their intrinsic physi- 
cochemical propensities. Relating this to 
our findings, the ancient physicochemical 
background of the genetic code appears 
to still allow G- or PYR-affinity of amino 
acids in present-day protein sequences to 
closely follow PUR/PYR density profiles, 
respectively, of their cognate mRNAs. 
Finally, direct binding of mRNAs to pro- 
teins they code for is also consistent with 
the proposal that the principal function 
of ancient proteins during the transition 
from the RNA world was indeed struc- 
tural stabilization of RNA molecules. 37 
In line with this, our results suggest that 
strong binding complementarity may 
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Figure 3. Connection between knowledge-based nucleobase-binding preferences of amino acids and the base content of their cognate codons. (A) 
Correlation between G-interaction preferences of amino acids from Miller-Urey experiment 31 and the average G-content of their codons in mRNAs of 
the entire human proteome. (B) Correlation between A-interaction preferences of other amino acids and the average A-content of their codons. (C) 
Correlation coefficients (R) between different amino-acid interaction preferences and the respective compositions of their codons. Amino acids are 
grouped into three subsets, which are colored according to the legend. Note that due to the definition of knowledge-based scales, negative correla- 
tions indicate positive matching between the given content of codons and the binding preferences of their cognate amino acids and vice versa. 



exist predominantly at the level of longer 
mRNA and polypeptide stretches rather 
than individual amino acids and codons. 

Potential Roles of 
Cognate Interactions 
in Present-Day Biology 

Arguably the strongest experimentally 
testable prediction of our complementary 
binding model is that a large fraction of 
proteins will directly bind to their cognate 
mRNAs in an in-frame manner, especially 
if both molecules are structurally destabi- 
lized. 16,17 Which processes in present-day 
biological systems could depend on such 
binding? Given that one observes comple- 
mentary matching mainly at the level of 
primary sequences of the two biopolymers, 
one should first consider all the contexts 
in which both polymers are significantly 
unstructured. However, we do not exclude 
the possibility of direct cognate bind- 
ing according to the same physicochemi- 
cal principles even if both molecules are 
largely structured. Namely, it remains 
to be explored how putative binding hot 



spots at the level of primary sequence pro- 
files map onto folded mRNA and protein 
secondary and tertiary structures. It is pos- 
sible that the binding potential, suggested 
by primary-sequence profiles, remains 
present even if both molecules adopt well- 
defined folds. What is more, the fact that 
adenines behave qualitatively differently 
from guanines, as discussed above, points 
at additional complexities that still need to 
be fully understood. 

In principle, all scenarios where 
mRNAs and/or proteins exist as partially 
extended polymers might be affected by 
cognate interactions of the kind described 
above, such as, for example, during splic- 
ing of pre-mature mRNAs, protein secre- 
tion, or translocation. However, probably 
the most obvious process where binding 
of mRNAs to cognate proteins could 
have a regulatory role is translation. Here, 
the newly synthesized proteins would 
repress the translation of their own mes- 
sage by directly or indirectly competing 
with the binding between ribosome and 
mRNA and forming a negative feed- 
back loop. From thymidylate synthase 



to dihydrofolate reductase to different 
ribosomal proteins, there is a number of 
known examples where precisely such 
regulation takes place. 6,10 " 15,38 It will be 
interesting to study the parallels as well as 
discrepancies between our proposal and 
the specific binding mechanisms, which 
have been proposed in some of these 
cases. For example, cognate binding in 
this context is known to occur not only in 
mRNA coding sequences, but also in their 
untranslated regions (UTRs). An impor- 
tant frontier in this regard will be to study 
potential profile complementarity even 
between non-coding RNA segments and 
different protein profiles. Finally, it will 
be interesting to explore whether cognate 
binding-based translational regulation 
may be a more widespread phenomenon 
than typically considered, as suggested by 
our model. 

Another context in which cognate 
interactions may also play a natural role is 
viral assembly. After all, a loaded viral cap- 
sid can be thought of as one of the simplest 
entities in which genetic message resides 
in close proximity of its own product. 
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Recently, Stockley and co-workers have 
used single-molecule fluorescence correla- 
tion spectroscopy to show that specific and 
direct binding of capsid proteins to single- 
stranded RNA of bacteriophage MS2 
and satellite tobacco necrosis virus plays 
a key role in RNA compaction and cap- 
sid assembly and packaging. 39,40 It will be 
interesting to examine to what extent the 
binding mechanisms seen in these systems 
agree with our proposal. What is more, 
analogous principles may in fact also be 
applied to single-stranded DNA viruses. 

Finally, the mRNA-protein com- 
plementarity hypothesis may also be 
intimately related to the structure and 
function of ribonucleoprotein complexes 
and RNA— protein granules. A number of 
recent studies have reported and charac- 
terized instances of phase separation in the 
cytoplasm 41 and nucleoplasm, 42 a process 
analogous to lipid-raft formation in the 
membrane, which in aqueous compart- 
ments results in the formation of liquid 
droplets. These droplets define specific, 
non-membrane-bound compartments 
typically rich in proteins and RNA (such 
as nucleoli, germline P granules, P-bodies, 
stress granules, Cajal bodies, etc.), 43 and 
are in many cases known to be the sites 
of mRNA storage, processing, and decay. 
Moreover, it has also been suggested that 
both multi-valency (i.e., many potential 
binding sites) and presence of low com- 
plexity (i.e., disordered) regions in proteins 
may be important factors in the formation 
of such phase-separated compartments. 43 
It is tempting to relate these observations 
to the mRNA-protein complementar- 
ity hypothesis, whereby transient cognate 
mRNA-protein interactions would pro- 
vide the necessary driving force for such 
compartmentalization. Twenty years ago, 
Kyrpides and Ouzounis suggested that 
cognate protein-mRNA binding inter- 
actions may represent an ancient mecha- 
nism through which mRNA stability is 
auto-regulated. 10 Although their model 
of such binding lacked specific mechanis- 
tic details, they saw it as an extension of 
the stereochemical hypothesis at the level 
of complete polymers, a view now cor- 
roborated by our microscopically detailed 
model. Importantly, in their theoretical 
scheme, untranslated or excess mRNAs 
pair up with their cognate proteins and 



are subsequently degraded. One may spec- 
ulate that the above compartments, such 
as P-bodies or stress granules, may be the 
actual sites in which such interactions take 
place. In our picture, being disordered 
should facilitate interactions between 
mRNAs and cognate proteins, but this 
might also promote degradation. What 
is more, such destabilization might actu- 
ally be a consequence of different stresses 
(e.g., heat shock), which also fits well with 
the role of P-bodies and stress granules in 
general. 

To sum up, our recent findings con- 
cerning the close inter-dependence 
between compositional profiles of mRNAs 
and their cognate proteins provide a novel 
framework for analyzing interactions 
between proteins and not just mRNAs, 
but rather nucleic acids in general. 
Namely, all of the findings and principles 
presented here could also be generalized to 
other types of RNA molecules in present- 
day systems, including long non-coding 
RNAs, as well as DNA. However, it should 
be emphasized that cognate interactions of 
the kind discussed here could have been 
functionally relevant only in ancient cells 
and it could very well be that in the course 
of evolution they were replaced by more 
specific and more efficient mechanisms. 
Alternatively, as we would like to suggest, 
it is possible that they actually became a 
foundation on top of which these more 
specific mechanisms operate. We hope our 
results and suggestions will stimulate the 
field to explore these exciting possibilities. 
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